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ABSTRACT 

Research into socio-technical systems like Wikipedia has 
overlooked important structural patterns in the coordination 
of distributed work. This paper argues for a conceptual re¬ 
orientation towards sequences as a fundamental unit of analy¬ 
sis for understanding work routines in online knowledge col¬ 
laboration. We outline a research agenda for researchers in 
computer-supported cooperative work (CSCW) to understand 
the relationships, patterns, antecedents, and consequences of 
sequential behavior using methods already developed in fields 
like bio-informatics. Using a data set of 37,515 revisions 
from 16,616 unique editors to 96 Wikipedia articles as a case 
study, we analyze the prevalence and significance of differ¬ 
ent sequences of editing patterns. We illustrate the mixed 
method potential of sequence approaches by interpreting the 
frequent patterns as general classes of behavioral motifs. We 
conclude by discussing the methodological opportunities for 
using sequence analysis for expanding existing approaches to 
analyzing and theorizing about co-production routines in on¬ 
line knowledge collaboration. 
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INTRODUCTION 

Once upon a time, organizational researchers trafficked in 
platitudes that “X is an organization that works in prac¬ 
tice, but not in theory” where X might be Linux, Wikipedia, 
GitHub, or many other “open” production systems. Far from 
being boundary cases with little “real world" relevance, the 
replication of these models’ success across domains outside 
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of software and encyclopedias has prompted the re-evalaution 
of many core economic, psychological, and social frame¬ 
works for organizations. In particular, concepts like “online 
knowledge collaboration” and “commons-based peer produc¬ 
tion” define organizational models where distributed mem¬ 
bers self-organize to create knowledge-based goods in the ab¬ 
sence of hierarchies or markets [10, 19]. 

While theories and methods from network science have been 
brought to bear to understand the relationships that structure 
online knowledge collaboration [3, 60], these predominately 
adopt a static view of collaboration. Conversely, studies ex¬ 
ploring the dynamics of how these collaborations emerge and 
change over time often overlook the role of interaction pat¬ 
terns in structuring these systems. Reconciling both of these 
streams of research, sequences should become a fundamen¬ 
tal unit of analysis for capturing how patterns of interac¬ 
tion unfold over time into coherent structures capturing co¬ 
production routines. Sequence analysis methods have been 
widely used in biology and sociology for decades, and the 
richness and scale of event log data available to HCI, CSCW, 
and information system researchers naturally lend themselves 
to such sequence analysis. In addition to adapting these quan¬ 
titative methods from enumerating and comparing sequence 
patterns, these sequences likewise invite qualitative inquiry to 
contextualize their prevalence and interpret their role in struc¬ 
turing socio-technical behavior. Despite its mixed methods 
lineage and potential, sequence methods have largely been 
overlooked for understanding questions about organizational 
routines within online knowledge collaborations. 

Complex organizational practices are composed of more ba¬ 
sic behavioral patterns that are in many ways analogous to 
the basic genetic sequences encoding complex proteins [25, 
69]. Socio-technical systems like Wikipedia contain abun¬ 
dant event logs encoding sequence data about the series of re¬ 
visions made to artifacts (i.e., articles) by individuals (/.<?., ed¬ 
itors). Research into the temporal dynamics of online knowl¬ 
edge collaboration has received sustained attention across 
scholarship in HCI and information systems. Largely speak¬ 
ing, this research has employed two primary approaches for 
capturing system dynamics: (1) “aggregating” sequence logs 
to represent a particular process, creating multiple “snap¬ 
shots” of the process over time [63], or (2) “collapsing” se¬ 
quence logs to calculate a pair-wise relationship between en¬ 
tities (e.g. affiliation between users), often forming a network 
of these entities and using social network analysis (SNA) 


1 



techniques to study the structure of the network [37, 39, 30], 
While our interest is analyzing the sequences of production 
and coordination activities that make up work routines, this 
prior work aggregates and collapses event sequences, thus 
obscuring the sequential ordering and structures in longitu¬ 
dinal records of co-production activities. In this paper we 
offer a general approach for representing and analyzing event 
log data from peer production systems that employs methods 
from sociology and bio-informatics in order to track behav¬ 
ioral motifs and identify the routines of collaborative work. 

In the following section, we review prior work from sociol¬ 
ogy and organizational theory around organizational routines 
to understand the dynamics of social change, and contrast this 
line of scholarship with research that has examined the tem¬ 
poral dynamics of online peer production communities. A 
research agenda employing sequences within socio-technical 
systems as a fundamental unit of analysis has the potential to 
expand theories and methods for both established CSCW top¬ 
ics like recommendation systems and crowdsourcing as well 
as emerging CSCW topics like online education, citizen sci¬ 
ence, and crisis informatics. 

In order to illustrate our proposed framework, we describe a 
case study that explores the co-production sequences. Build¬ 
ing on theoretical frameworks of generative routines [18], our 
empirical analysis focuses on the sequences of contributors 
collaborating to create a Wikipedia article. Using a a repre¬ 
sentative sample of 93 articles from English Wikipedia, with 
37,515 editing activities made by 16,616 distinct contribu¬ 
tors, this case explores whether the co-production entails a 
small cohort of editors who work closely together, or if con¬ 
versely the production of content within Wikipedia is built 
upon lengthy editing sessions by the same editor. Our anal¬ 
ysis of the behavioral sequences reveal complex and non- 
random patterns and shed light on the prevalence of behav¬ 
ioral motifs (/.<?., generative routines) that organically emerge 
as contributors make choices about when and how to engage 
in online knowledge collaboration. Towards the end of this 
paper, we outline research questions in these CSCW topics to 
be explored using sequence analysis methods. 


BACKGROUND AND REVIEW 

Sequence analysis shares a rich interdisciplinary literature 
with psychology, economics, and sociology where it exam¬ 
ines patterns within temporally-ordered data, independent 
variables that influence these patterns, and how these patterns 
influence dependent variables. Critically, adjacent elements 
of a sequence are typically not random, but have strong de¬ 
pendencies reflecting topical, causal, or other latent similari¬ 
ties [2], Traditional sequence analyses may model the step- 
by- step transitions between events or analyze the whole se¬ 
quence of events by measuring distances between them, iden¬ 
tifying unique sequences, or matching similar “families” of 
sequences together. Sequence analysis methods have a rich 
mixed methods history and have been used to model depen¬ 
dencies in participation shifts during conversations [27], to 
analyze reciprocated interactions on team performance [57], 
and to suggest new ways to develop social theory [52], 


We note that the use of these methods for understanding or¬ 
ganizational processes have primarily focused on production 
and coordination in traditional co-located, synchronous, and 
small-scale co-located organizations, which are fundamen¬ 
tally different from the context of the current investigation: 
distributed, asynchronous, and large-scale online knowledge 
collaborations. Our goal here is to bridge the gap between 
methodological developments in sequence analysis and their 
potential to use large-scale event logs around online knowl¬ 
edge collaborations to understand complex social processes. 
This section reviews prior work on organizational routines 
and their role in facilitating coordination, and connects it to 
the research stream investigating temporal dynamics within 
technology-mediated collaboration, as well as to the theoreti¬ 
cal and ethnographic work that aims at understanding patterns 
of interaction in online knowledge collaboration. 

Organizational routines 

Organizational routines are the primary means by which orga¬ 
nizations coordinate work [8], Coordination entails managing 
dependencies between activities [47]. Organization design 
theory suggests that coordination can be achieved through 
a variety of mechanisms [28, 62], but points to organiza¬ 
tional routines as the foundation of any work process that 
involves coordination among multiple actors. Routines re¬ 
fer to repeated patterns of behavior that are bound by rules 
and customs and that do not change very much between it¬ 
erations [20]. Routines are the product of explicit attempts 
to design work practices; they are conducive in helping retain 
organizational history, but they can also be a source of inertia. 

An alternative view emphasizes the emergent and dynamic 
dimension of routines that are referred to using evolutionary 
terms like “adaptation” or “mutation [49].” Pentland and col¬ 
leagues [51, 54] used grammar as an analogy to explain vari¬ 
ation in routines: routines consist of rules that allow people to 
select elements of a repertoire in order to construct a particu¬ 
lar sequence of behavior. Feldman [21] suggested that newly- 
introduced sequences of action can also lead to changes in 
rules and procedures. She introduced agency into the notion 
of routine and viewed routines as improvisational (or “per¬ 
formative”). This perspective highlights the internal dynamic 
of routines, where continuous change is the result of partici¬ 
pants’ reactions to outcomes of previous iterations of the rou¬ 
tine. Thus, such routines are not merely repetitive but are 
generative because they are refined and facilitate knowledge 
co-production. 

In an attempt to reconcile the two perspectives, Feldman and 
Pentland [20, 53] present an ontology of organizational rou¬ 
tines distinguishing between: (a) the ostensive aspect of a 
routine embodies what we typically think of as the structure; 
and (b) the performative aspect embodies the specific actions, 
by specific people, at specific times and places, that bring the 
routine to life. The structural/ostensive aspect enables peo¬ 
ple to account for specific performances of a routine; while 
the dynamic performative/dynamic aspect creates, maintains, 
and modifies those structural aspects. The relationship be¬ 
tween structural and dynamic aspects of routines creates an 
on-going opportunity for variation, selection, and retention of 
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new practices and patterns of action within routines. There¬ 
fore, routines allow organizations to achieve the balance be¬ 
tween adaptability and stability, affecting organizations’ abil¬ 
ity to adapt to changing circumstances. 

Studies of organizational routines typically involve ethno¬ 
graphic case studies, offering rich interpretations of specific 
practices, but a limited capacity to more broadly illuminate 
the structure and dynamics of routines and patterns. More¬ 
over, the sheer scale of digital trace data within long-term and 
large-scale socio-technical systems’ event logs complicate 
many traditional quantitative and qualitative research meth¬ 
ods [3, 45]. Developing a more general framework requires 
grammatical or lexical approach to explain the diversity of 
routine elements and sequences across contexts and organi¬ 
zations [25], Such a “grammatical” approach integrating in- 
depth field studies with archival digital trace data would per¬ 
mit a mixed methodological approach to triangulating across 
levels of analysis, cases, and time. 

Online knowledge collaborations 

Online knowledge collaborations occur in the absence of tra¬ 
ditional organizational mechanisms like stable membership, 
convergence, sustained interaction, or shared goals, but in¬ 
stead rely upon fluid boundaries, norms, participants and in¬ 
teractions [19]. Even so, empirical analyses of online knowl¬ 
edge collaborations have largely adopted a static view of mo¬ 
tivations, roles, and routines, ignoring how work unfolds and 
changes over time in relation to other events [13], The suc¬ 
cess of peer production systems primarily relying upon dis¬ 
tributed, volunteer labor can be understood by examining the 
artifacts within these collaborations as the focal point where 
small, self-motivated contributions gradually accumulate and 
inspire others to contribute in kind [34]. Collaborators in 
online knowledge collaborations negotiate tensions between 
changing and retaining established artifacts within systems 
by adopting specialized roles, shifting production foci, and 
employing identifiable action patterns [38], 

Online knowledge collaborations like Wikipedia and open 
source development projects rely upon detailed records of 
contribution and communication patterns. Many socio- 
technical systems archive records and other meta-data about 
changes in the state of the system into event logs. These data 
are valuable for contributors to trace changes across versions 
of documents [24], to evaluate others’ contributions [61], and 
to build additional tools to support collaboration [17]. Users’ 
contributions drive the development of artifacts within the 
system, but these contributions also re-shape the governance 
mechanisms among developers, portending new roles and 
capacities within the online community [14], This agency 
versus structure tension manifests as a changing emphasis 
between experimenting with new ideas and processes on the 
one hand [48] and supporting stable organizational routines 
and social roles on the other [23, 29], Event log data capture 
these complex interactions, by encoding both the agency of 
contributors to act across artifacts and respond to others’ 
contributions as well as recording events that structure the 
community such as the development of rules. 


Sequence analysis 

Following calls to analyze the micro-level processes of peer 
production systems [44] and despite the reliance on event logs 
to understand how work processes and roles are structured 
over time in prior research [29], these analyses have largely 
ignored the role of detailed sequences in log data and over¬ 
looked the temporal and artifact-related dependencies. Prior 
research investigating co-production dynamics as either ag¬ 
gregated sequence log to represent a particular process, cre¬ 
ating multiple “snapshots” of the process over time, or “col¬ 
lapsed” sequence logs to calculate a pair-wise relationship be¬ 
tween entities, forming a network of these entities and using 
social network analysis techniques to study the structure of 
the network [37, 39, 30], Less relevant to our investigation 
are prior works that employed information visualization tech¬ 
niques — rather than a formal knowledge representation — 
to depict the dynamics of collaborative work [18, 59, 65]. 

Few studies have gone further to provide a more detailed ac¬ 
count of temporal dynamics within peer-production through 
the combination of the two approaches highlighted above and 
by generating a series of temporal snapshots of network struc¬ 
ture [31]. While these prior works capture some aspects of 
the temporal dynamics underlying peer production, they ag¬ 
gregate and collapse event sequences, rather than analyze 
patterns within sequences of co-production routines. An¬ 
other relevant stream of research has analyzed patterns within 
event sequences, but was restricted to a single phase transi¬ 
tions, for example in studying activity [67], contributor [50] 
or role [7] transitions. 1 Despite the various approaches for 
studying temporal dynamics in online production communi¬ 
ties [23, 31, 36], research in the area — in particular, peer 
production — has not yet developed approaches and meth¬ 
ods for identifying patterns of co-production activities from 
event log sequences [25]. Few recent studies have attempted 
to analyze multi-stage transitions from event log data [26] and 
have explored the application of methods borrowed from bio¬ 
informatics [4, 25, 69]. Nonetheless, these studies have em¬ 
ployed simple knowledge representation schemes, often with 
an event of a single type. 

Our goal in this paper is to extend existing approaches to se¬ 
quence analysis by proposing a general framework for ana¬ 
lyzing multi-stage and multi-type co-production routines, by 
employing event log sequences and relying on methods com¬ 
mon in bio-informatics. Because event logs within online 
knowledge collaborations encode meta-data about the who, 
what, when, where, and how of socio-technical activities, an 
analysis of sequence logs could reveal the routines within ar¬ 
tifact coordination and co-production. As we detail in the 
subsequent section, such event logs could allow researchers 
to enrich the representation of behavioral changes and orga¬ 
nizational structure at multiple levels of analysis while us¬ 
ing a mixture of quantitative and qualitative methods. Much 
as computational methods from natural language processing 
or social network analysis enable researchers to extract new 
features from a corpus of unstructured data, we imagine se¬ 
quence analysis becoming a similar methodological toolkit 

'The restriction to a single phase transition is often associated with 
the analytic approaches employed (e.g. Markov models). 
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Activity 

Edit type 

Artifact 

Article 

Performer 

Editor 

Order 

Time 

Rephrase Text (RT) 

X 

Ui 

1:01 

Change Markup (CM) 

X 

u 2 

2:02 

Add Content (AC) 

X 

U 3 

3:03 

Fix Typos (FT) 

X 

u t 

4:04 

Add Vandalism (AV) 

X 

u 4 

5:05 

Delete Vandalism (DV) 

X 

Vx 

6:06 


Table 1. Example of an event log. 
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1:01 2:02 3:03 4:04 5:05 6:06 

Figure 1. The event log in Table 1 can be represented as a sequence of 
activities (top), artifacts (middle), or performers (bottom). 

for researchers to interrogate data. In the remainder of the 
paper, we use a case study to explore the potential of quan¬ 
titatively mining and qualitatively interpreting sequence data 
derived from socio-technical event logs. Generalizing from 
this case study, we outline a proposed research agenda for 
employing large-scale event log data to understand the rela¬ 
tionships, patterns, predictors, and consequences of sequence 
data across various organizational phenomena. Finally, we 
discuss the implications this research agenda has for extend¬ 
ing sequence methods to empirically study and theorize rou¬ 
tines in technology-mediated collaborations. 

SEQUENCE ANALYSIS METHODOLOGY FRAMEWORK 

The availability of detailed sequential event log data in large- 
scale socio-technical systems provides substantial opportu¬ 
nities to extend interdisciplinary empirical research around 
routines in new organizational forms. Understanding the 
distribution, evolution, and generative role of activity se¬ 
quences within these systems can improve theories about on¬ 
line collective action, distributed work, and shared gover¬ 
nance. Socio-technical systems like Wikipedia record the 
complete history of every change made to an article since 
the first edit as well as every revision made by any user. 
Analogous event logs likewise exist for open source reposi¬ 
tories containing histories of the commits from every user to 
a project, and social networking sites archiving the history 
of actions and posts users have made. These event logs con¬ 
tain distinct records encoding a variety of meta-data related 
to production (e.g., developing an encyclopedia article), coor¬ 
dination ( e.g ., discussion around who should do what), user 
interactions (e.g., comments by one user on another user’s 
personal page), and governance (e.g., policy enforcement). 

Given the diverse affordances of socio-technical systems, 
we will bracket our proposed framework by focusing on 
the modality of users collaborating to create artifacts (e.g.. 


Wikipedia), although alternatives such as users managing re¬ 
lationships to facilitate interpersonal communication (e.g., 
Twitter) certainly exist. Following on prior work from “pro¬ 
cess mining” that extracts information from event logs to un¬ 
derstand how organizations manage business processes [64], 
we employ four event log features to construct sequences that 
capture organizational routines [63]: 

Activity - A system action such as a revision to an article. 
Artifact - An entity which is acted upon such as an article. 
Performer - An entity executing the activity such as a user. 
Order - An index defining a sequence such as a timestamp. 

A simple example of an event log is given in Table 1 
for contributions (“commits”) from a variety of users 
{Ui, U 2 , U 3 , U 4 } to a single artifact X. As a generalizable 
framework, we construct activities, artifacts, performers, and 
order broadly. Activities are not merely edits or commits, 
but the granting or revocation of permissions, creation of 
new relationships, or instances of communicating between 
users. Artifacts are not merely information-based products, 
but may also refer to coordination elements of various types 
like discussion or email threads or governance repositories 
logging the history of permission, promotion, or enforcement 
actions. Performers are not only single human users, but po¬ 
tentially sub-users within an account (e.g., characters in an 
online game) or automated users (e.g., anti-vandalism hots 
on Wikipedia). Alternative frameworks emphasizing inter¬ 
personal communication and relationships communications 
might employ ego-to-alter elements as opposed to the user- 
to-artifact constructs as well. 

Step 1: Identification of sequences 

This event log of “users taking actions on an object” encodes 
at least three kinds of sequences from Table 1: (1) the se¬ 
quence of performers (e.g., users) who acted (U 1 -U 2 -U 3 - 
U 1 -U 4 -U 1 ), (2) the sequence of activities (e.g., commits) that 
were taken (RT-CM-AC-FT-AV-DV), and (3) the se¬ 
quence of artifacts (e.g., articles from the set A') that were 
acted upon (X-X-X-X-X-X). Each of these three se¬ 
quence types are trivial when looking at the event log of a 
single performer, the event log of a single activity, and the 
event log of a single artifact, respectively. For example. Ta¬ 
ble 1 represents a simple case of an event log related to a 
single artifact, where there is meaningful variation in the se¬ 
quence of activities (edit types) and of performers (users), 
but no variation in artifacts (articles). Crucially, these three 
sequences types have relationships with each other and may 
also re-occur at other times within the same artifact’s event 
log as well in other artifacts’ event logs. Figure 1 describes 
the three types of sequences that could be derived form the 
example in Table 1. 

Step 2: Schematization of sequences 

Once the fundamental elements representing online knowl¬ 
edge production have been defined, there is a need to for¬ 
mulate a knowledge representation scheme that would cap¬ 
ture the particular sequences of elements that are most rel¬ 
evant for specific research question at hand. For example. 
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questions pertaining to personal relationships between partic¬ 
ipants and their social networks require that the representation 
capture participants’ identities. In contrast, research ques¬ 
tions around the temporal evolution to the community’s or¬ 
ganizational structure call for representing participants’ roles 
and the timing of role-transition events (but not their iden¬ 
tities). We define our knowledge representation scheme in 
Table 2 with an example provided in Figure 2. 

Step 3: Analysis of sequences 

The third step in our proposed framework calls for the anal¬ 
ysis of the extracted sequences. Quantitative methods devel¬ 
oped independently in bioinformatics, natural language pro¬ 
cessing, and sociology have independently developed and ex¬ 
tended methods for analyzing similarities and variation pat¬ 
terns of sequence data that could be employed to analyze se¬ 
quential behavior across organizations [25, 69]. Methods in 
this vein focus on three classes of analysis: 

Pattern mining. Using enumerative and deterministic opti¬ 
mization methods to discover the types and frequencies of 
different sequences within a data set [15]. The number of 
sequences and tests for their likelihood against some base¬ 
line expectation can reveal important patterns [32]. 

Sequence similarity. A sequence’s similarity to other se¬ 
quences can be evaluated based on notions of proximity, 
where more deviations in the elements of one sequence 
compared to the other implies a larger “distance” [16, 66]. 
using techniques such as dynamic time warping These 
might be represented as a “phylogenetic tree” clustering 
similar sequences together or “dynamic time warping” to 
represent continuous time series as a sequence of discrete 
tokens, like letters [11, 46]. 

Probabilistic analysis. A sequence can also be the realiza¬ 
tion of an underlying process that might generate alterna¬ 
tive sequences as well [43], Dynamic programming meth¬ 
ods and Hidden Markov Models (despite their limitations) 
analyze sequences to understand the probability of the ob¬ 
served patterns as a result of transitions between finite 
states [56]. 

Each of these methods reflect different modeling assump¬ 
tions. While developed for domains other than online peer 
production, these tools can generally be applied to sequence 
data of any kind with substantial potential for generating 
novel metaphors for understanding complex behavior [25], 

Step 4: Interpretation of sequences 

The final step of our sequence analysis methodology frame¬ 
work is to employ qualitative methods to interpret and trian¬ 
gulate the observed patterns. Such qualitative analysis may 
rely on secondary and archival data from the online com¬ 
munity, interviews with community members [12], or a field 
study [36]. Employing qualitative methods allow for enrich¬ 
ing the results obtained through quantitative sequences with 
a deep understanding of the contextual factors contributing to 
the observed patterns. Such a mixed method approach can 
yield richer insights regarding the motivations and contexts 
for co-production routines. 


e->e->e->c->(e->b 

1:01 2:02 3:03 4:04 5:05 6:06 

Figure 2. The sequence from the example event log in Figure 1 (top) can 
be schematized using the rules from Table 2 (bottom). 

CASE STUDY 

The focus of this empirical investigation is the English 
Wikipedia and its community of editors. In contrast to the 
organization of work within traditional organizations, in the 
creation of a Wikipedia article each contributor is free to gen- 
eratively enact a role in the moment [19]. Thus, contributors 
choose not only when and if to participate the co-production 
process, they are also free to determine the exact nature of 
their activity (with very few restrictions, e.g. a non-registered 
member is not allowed to create a new Wikipedia article). The 
goal of this case study is to explore a particular aspect in co¬ 
production routines: the prevalence of different sequences of 
editing patterns, using the contributor’s ID as the focal point. 
Such an analysis has the potential to yield insights regarding 
the number of contributors participating in article co-creation 
and the sequences of their product-centric interaction (e.g., 
discovering “ping-pongs” between a few contributors work¬ 
ing collaboratively to improve the article). 

Sample and Data Extraction 

Our sample included 93 articles that provide a representa¬ 
tion of Wikipedia’s topical categories and this sample has 
been employed in earlier studies [6, 5]. The set of articles 
was selected based on randomization and a stratified sam¬ 
pling of Wikipedia’s topics, congruent with Wikipedia’s top- 
level classification [41] (categories: culture, art, and religion; 
math, science, and technology; geography and places; people 
and self; society; and history and events). From this original 
set of 96 Wikipedia articles, 3 were discontinued and their 
edit history is no longer available, leaving us with 93 arti¬ 
cles. We tracked all activities to this sample of articles, from 
each article’s inception until December 31, 2012. Altogether, 
this data set includes 37,515 editing activities made by 16,616 
distinct contributors. 

Our investigation is primarily based on a large-scale quan¬ 
titative analysis of Wikipedia system activity logs, which is 
complemented with a qualitative analysis. While the quanti¬ 
tative analysis is essential for identifying those patterns that 
recur frequently and for drawing statistical conclusions, the 
qualitative analysis allows for a deeper understanding as to 
why and how co-production unfolds. For the quantitative 
analyses, we employed data harvested from Wikipedia logs. 
We queried the Wikipedia API recording the time and con¬ 
tributor of each edit made to articles in our sample. In ad¬ 
dition, we tracked the activity of all contributors who have 
edited articles in our sample, recording key events in con¬ 
tributors’ Wikipedia career (e.g., from the date of first edit to 
Wikipedia). For each editing activity in each of the articles in 
our sample, we recorded the contributor’s ID and the timing 
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Label 

Description 

Count 

% 

Size 

Duration 

Last session 

Last same-session 

A 

The contributor also made the previous 
editing session in this article 

2,228 

8 % 

705 

3.15 

2.00 

2.00 

B 

The contributor was active in the edit¬ 
ing session preceding the last session 

1,835 

6 % 

1447 

0.89 

1.71 

4.8 

C 

The contributor was active in this arti¬ 
cle 3-5 editing sessions prior 

1,350 

5% 

623 

0.94 

4.61 

26.2 

D 

The contributor was active in this arti¬ 
cle >6 editing session prior 

3,944 

14% 

812 

0.75 

4.64 

192.4 

E 

The first editing session of the contrib¬ 
utor to this article 

19,562 

68 % 

553 

0.75 

11.66 

n/a 

F 

The first editing session of the contrib¬ 
utor in Wikipedia 

4 

0 % 

429 

1.25 

6.25 

n/a 


Table 2. Coding system. Average session duration is measured in hours, average time since last session and average time since same user last session is 
measured in days. 


of activities. Each article was then represented as a vector of 
temporally-adjacent sequential editing activities. 

In addition, qualitative analysis was used in order to gain a 
deeper understanding of a few recurring patterns as examples. 
For this analysis, we employed a combination of manual and 
automated methods, relying on both Wikipedia (e.g., using 
the version compare tool to manually determine the nature 
of the particular activity) and external resources (e.g., Web 
activity related to the topic of a particular Wikipedia article, 
as a means of identifying real-world events that may have 
triggered Wikipedia editing activity). 

Data representation 

In order to analyze sequences of activities, it was essential 
that we devise a consistent way for representing: (a) the fun¬ 
damental unit of activity; and (b) a sequence of activities. We 
chose an editing session as the basic unit of analysis, collaps¬ 
ing sequential edits made by the same contributor within a 
very short period of time [26], Often, contributors would save 
their edit-in-process, where each saved version is recorded in 
Wikipedia log as a revision of the article. 

We sought a more meaningful unit of analysis, and collapsed 
sequential edits within the 10 minutes into s single edit ses¬ 
sion. The 10-minute threshold was determined based on a 
statistical analysis of time-between-edits . The time stamp 
of the collapsed editing session was defined as the duration 
between the timing for the first and last edit activities in¬ 
cluded in that session. Consider the example of a series of 
edits made by user U\\ (7i(10:10)-f7i(10:12)-f7i(10:21); this 
edit activity sequence will be collapsed onto the editing ses¬ 
sion: C/i([10:10,10:21]). In investigating sequences in recur¬ 
ring patterns, we process all articles in the data set together. 
After collapsing all edit activity vectors, 28,923 vectors of 
editing sessions remained. 

In order to represent sequences of activities (i.e., motifs), it 
was essential that we devise a novel representation of con¬ 
tributor sequence motifs. Given our focus on the sequences 
of contributors co-creating the articles, we needed to design 
a generic motif representation that could represent similar se¬ 
quences by different contributors. For example, consider the 


set of four contributors {U\, U 2 , C /3 , C/ 4 }: we wanted to cre¬ 
ate a single motif to capture the following similar sequences: 

U 1 -U 1 -U 2 -U 2 - U 3 
U 2 -U 2 -Ui-U A - C /1 
U 3 -U 3 -U 1 - C7i - U 2 

These all represent 2 consecutive editing sessions by the same 
contributor, followed by two consecutive sessions by a differ¬ 
ent contributor, then followed by an editing session by a third 
contributor. 

The questions our representation tries to address are: (1) Who 
is the person making the most recent edit? (2) Is it the same 
person who has edited the prior edit? (3) An earlier edit to 
the same article? (4) Or does this edit represent his first activ¬ 
ity in the particular article? While prior works have proposed 
various approaches for representing co-production motifs in 
Wikipedia, these motif representations were not appropriate 
for the more general purpose. Our proposed motif represen¬ 
tation traverses the editing session vectors for each article in 
our sample, starting from the earliest session, and for each 
editing session records the most recent time that same con¬ 
tributor was active, as seen in Table 2. 

Distribution of sequences 

We begin by presenting descriptive statistics for motif labels 
(see Table 2). We notice that the vast majority of sessions 
( 68 %) represent a contributor’s first-time activity in the par¬ 
ticular article (label E), followed by labels representing a 
non-recent activity by that contributor (D; 14%). The analy¬ 
sis of size of revision shows that revisions by the contributor 
making the edit session before last ( B ) is the largest (roughly 
1500 characters; possibly indicating edit wars, see details be¬ 
low), while those making the first edit to Wikipedia ( F ) tend 
to make smaller changes. Consecutive edit sessions by the 
same contributor (A) are of a longer duration, indicating that 
they are often composed of several edits. An analysis of the 
time between editing sessions reveals that contributors new 
to the article ( E ) tend to edit after long dormant periods (on 
average, almost 12 days), contributor making the edit session 
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before last ( B) often quickly re-act to the previous session 
(on average, under 2 days; representing ping-pongs between 
contributors). Finally, an analysis of the times between same- 
contributor sessions shows that inactive contributors ( D ) of¬ 
ten stay away from editing an article for prolonged periods 
(on average, over six months). 

Table 3 below presents the most frequent motifs in lengths of 
2-4 labels. 2 As can be seen from these results, the vast major¬ 
ity of motifs represent sequences of activities by editors new 
to the article (label E ), sometimes interrupted by an editing 
session by a contributor that was active a while ago ( D ). This 
is not surprising, given that 68% of single-label motifs rep¬ 
resent new-comers ( E ) and sessions by inactive contributors 
(D) are second most frequent. 

A more telling statistics calculates those motifs that appear 
substantially more (or less) times than what is expected by 
chance alone. Assuming that the appearance of a label is in¬ 
dependent of the appearance of the previous labels, this is 
calculated independently for each motif length. That is, the 
calculation of by-chance occurrence assumes there is no de¬ 
pendency between sequential labels, such that the expected 
frequency of every motif is simply the multiplication of the 
frequencies of all its labels. For example, the expected fre¬ 
quency of the motif B-C-A is 6% x 5% x 8% = 0.024%. 
Our analyses identified those motifs that appear significantly 
above (or below) chance, using a two-tailed Z- test. We fo¬ 
cus our analyses on the patterns that reoccur frequently in our 
sample: 2-motif sequences reoccurring at least 200 times; 3- 
motif sequences with more than 100 occurrences; and 4-motif 
sequences with more than 50 occurrences. When calculating 
statistical significance levels for the Z-test, we apply the Bon- 
ferroni correction: a statistical method for counteracting the 
problem of multiple comparisons (family-wise error rate) [9]. 
The Bonferroni correction is very conservative, and comes 
at the cost of increasing the probability of producing false 
negatives, and consequently reducing statistical power. Thus, 
statistical significance after applying this correction suggests 
that the null hypothesis could be rejected with very high con¬ 
fidence. 

Table 4 below presents the most significant (highest absolute 
Z scores) for motifs of 2-4 labels (the frequency of all 
motifs in this table is significantly above/below chance at 
p < 0.001; with Bonferroni correction). As Table 4 reveals, 
those motifs that appear significantly above/below chance 
are quite different from the most frequent motifs described 
in Table 3. We note that while few of the significant motifs 
are very common (namely sequences of E), the other 
statistically significant motifs are less common. For example, 
same contributor motifs (label A) appear significantly more 
than what is expected by chance alone: the motif A—A—A is 
expected to occur in 0.05% of 3-label motifs, while it appears 
1.69% of the times; Z-score = 131 (p < 0.001). 


2 We analyzed motifs of up to 10-label length, and in essence the 
patterns observed for these longer motifs were quite similar to the 
shorter motifs presented in the tables. Because of space limitations, 
we decided to exclude these longer motifs from this paper. 


1 st 

La 

2 nd 

t>el 

3rd 

4th 

Count 

Fraction 

E 

E 



14,003 

48.6% 

E 

D 



2,789 

9.7% 

D 

E 



2,686 

9.3% 

E 

B 



1,014 

3.5% 

B 

E 



1,003 

3.5% 

E 

E 

E 


10,469 

36.4% 

E 

D 

E 


2,024 

7.0% 

E 

E 

D 


1,873 

6 .1% 

D 

E 

E 


1,742 

6 .1% 

E 

E 

E 

E 

7,973 

27.8% 

E 

E 

E 

D 

1,397 

4.9% 

E 

E 

D 

E 

1,383 

4.8% 

E 

D 

E 

E 

1,348 

4.7% 

D 

E 

E 

E 

1,227 

4.3% 


Table 3. Most frequent sub-sequences. 


1 

La 

2 

bel 

3 

4 

Expected 

Observed 

Z-Score 

A 

A 



0.59% 

2.91% 

51.2 

B 

B 



0.40% 

1.36% 

25.7 

E 

A 



5.21% 

2.83% 

-18.2 

A 

E 



5.21% 

3.48% 

-13.2 

E 

E 



45.74% 

48.57% 

9.6 

A 

A 

A 


0.05% 

1.69% 

130.8 

B 

B 

B 


0.03% 

0.57% 

58.2 

E 

E 

E 


30.94% 

36.42% 

20.1 

C 

B 

B 


0 .02% 

0.18% 

19.6 

B 

A 

A 


0.04% 

0.25% 

18.6 

A 

A 

A 

A 

0 .00% 

1.13% 

323.5 

B 

B 

B 

B 

0 .00% 

0.29% 

121.9 

B 

A 

A 

A 

0 .00% 

0.14% 

41.9 

A 

A 

A 

E 

0.03% 

0.43% 

38.0 

C 

A 

A 

A 

0 .00% 

0.07% 

35.2 


Table 4. Most frequent sub-sequences and Z-scores. All sequences in 
tables appear significantly above or below chance at p < 0.001 (with 
Bonferroni correction). 


Behavioral motifs 

A synthesis of the results presented above points to the emer¬ 
gence of several typical motifs, which shed light on the na¬ 
ture of co- contributor sequences in Wikipedia. In trying to 
interpret the meaning of the emerging patterns, we rely on 
qualitative analysis. Below, we describe the key patterns that 
emerge and explain their implications for our understanding 
of generative routines within peer-production. 

Solo contributor motifs 

Much of the production work in Wikipedia is the result of a 
series of consecutive editing sessions made by the same con¬ 
tributor. Despite same-user editing session (label A ) to make 
up only 8% from all labels, many of the multi-label motifs in¬ 
clude a series of A labels. Commonly an A editing session is 
the result of the collapse of several edits (on average spanning 
3.15 minutes). It is interesting to note that even after collaps- 
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1 

La 

2 

bel 

3 

4 

Example 

Count 

Z-Score 

A 

A 



C/r-t/i-L/i 

839 

51.2 

A 

A 

A 


i 

1 

IS 1 

1 

487 

130.8 

A 

A 

E 


U 1 -U 1 -U\-U 2 

245 

12.1 

A 

A 

A 

A 

Ui-Ui-Ui-Ui-Ui 

325 

323.5 

A 

A 

A 

E 

u 4 -u 4 -u 4 -u 4 -u 2 

122 

38.0 


Table 5. Uninterrupted Same-contributor sequences, counts, and Z- 
scores. All sequences in tables appear significantly above or below 
chance atp < 0.001 (with Bonferroni correction). 


1 

La 

2 

bel 

3 

4 

Example 

Count 

Z-Score 

B 

B 



£ 

i 

£ 

1 

393 

25.7 

B 

B 

B 


U1-U2-U1-U2-U1 

165 

58.2 

B 

B 

E 


U1-U2-U1-U2-U3 

167 

10.1 

E 

B 

B 


u 2 -u 1 -u 2 -u 1 

136 

6.5 

B 

B 

B 

B 

UI-U2-UI-U2-UI-U2 

83 

121.9 

B 

B 

B 

E 

£ 

1 

£ 

1 

tS 

1 

£ 

53 

21.6 

B 

B 

E 

E 

U- r U 2 -U,-U 2 -U 3 -U 4 

100 

6.5 


Table 6. Reactive contributor sequences, counts, and Z-scores. All se¬ 
quences in tables appear significantly above or below chance at P < 
0.001 (with Bonferroni correction). 


ing immediate same-contributor edits onto editing sessions, 
we still observe in Table 5 an above-chance occurrence of se¬ 
quences of label A. Often, A label editing sessions represent 
a quick reaction to a previous editing session (average time 
between is 2 days). 

The literature on peer-production has paid little attention to 
the nature of co-contributor sequences. A recent work on 
open source software development projects suggests that (a) 
the overwhelming majority of work is accomplished with 
only a single programmer working on a task and (b) when 
tasks appear too large for an individual they are more likely to 
be deferred until they are easier, rather than being undertaken 
through structured teamwork [34]. Our findings suggest that 
Wikipedia, too, may rely on a single contributor work ses¬ 
sions (as opposed to multi-contributor teamwork). This result 
stands in contrast to earlier accounts of Wikipedia which de¬ 
picted co-production as a tight networked co-authorship. 

Reactive contributing motifs 

The second interesting pattern we observe represents back- 
and-forth editing sessions (“ping-pongs”) between active 
contributors. Two-contributor ping-pongs may occur in im¬ 
mediate sequence (e.g., B-B motifs) or may be interrupted 
by a third user edit ( e.g ., B-C-B-B), and are characterized 
by a short duration between editing sessions (1.7 days for B 
label, compared to the 9-day average across all labels). As il¬ 
lustrated in Table 6, such motifs occur significantly more than 
what is expected by chance alone. 

Remarkably, despite the underlying assumption that large- 
scale technology-mediated online collaborations are the re¬ 
sult of multiple contributors working together, the literature 
on Wikipedia - and more broadly, studies of online produc- 


1 

La 

2 

bel 

3 

4 

Example 

Count 

Z-Score 

~E~ 

E 



U-rU 2 

14,003 

9.6 

E 

E 

E 


U-rU. 2 -U :i 

10,469 

20.1 

E 

E 

E 

E 

U-rU2-U :i -U 4 

7,973 

28.7 


Table 7. Common inactive contributor sequences, counts, and Z-scores. 
All sequences in tables appear significantly above or below chance at 
p < 0.001 (with Bonferroni correction). 


tion communities - have rarely provided empirical accounts 
of co-production work that involves a restricted set of indi¬ 
viduals working together. Interestingly, the one form of ping- 
pong that has been discussed in prior studies concerns de¬ 
structive editing activities (e.g., edit wars between vandals 
and vandalism fighters) [41, 65, 68]. 

A qualitative analysis of a selective set of “ping-pong” mo¬ 
tifs shows that many of those motifs correspond to edit wars 
(in almost all cases, the vandal was a non-registered member, 
identified only by his IP address). For example, an analysis 
of the “Flying Car” article found that active-contributor se¬ 
quences entailed vandalism-revert edit wars in August 2006 
(4 vandalism edits, each followed by a revert), lune 2007 
(3 vandalism-revert pairs), and May 2008 (vandalism-revert- 
vandalism-revert). Label B is characteristic of these ping- 
pongs and reflects an average large edit (over 1400 charac¬ 
ters; more than double the average size of all labels), pos¬ 
sibly reflecting ‘revert’ type edits (which are the size of the 
entire article content). Nonetheless, we did also observe few- 
and-active contributor sequences that represent consecutive 
sessions by constructive activity. For example, we observed 
multiple periods of constructive ping-pongs for the “Orange 
Revolution” article: in February 2006, August 2006 and lan- 
uary 2010 (a series of sequential revisions by two contribu¬ 
tors, where one contributor adds new content and the other 
reorganizes content on the page). 

Inactive contributor motifs 

The third pattern emerging from our analysis is that those 
editing the article for the first time tend to arrive consequently. 
Moreover, editing sessions represented by the label E often 
correspond to a single edit: their editing sessions are not the 
result of collapsing multiple edits (their average duration is 
0.75 minutes). Despite the high expected appearance for label 
E (comprising close to 70% of all labels), motifs representing 
a series of E labels appear significantly more than anticipated 
by chance alone. Table 7 describes those sequences appear¬ 
ing significantly more than was is expected by chance alone, 
while Table 8 presents examples for the reverse effect — new 
editors on an article remain engaged — showing that it ap¬ 
pears significantly below chance. 

We offer two alternative interpretations for these sequences. 
First, we proposed that sequences of first-time editors may 
represent dormant periods where “owners” do not attend to 
the article. This is implied by the lengthy times between the 
preceding editing session and the label E session (over 11 
days). Similarly, the duration leading to first-time-Wikipedia- 
edit (label F) is also long, over 6 days. Our second explana- 
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1 

La 

2 

t>el 

3 

4 

Example 

Count 

Z-Score 

E 

A 



Ui-lh 

815 

-18.2 

E 

A 

E 


i 

iS 1 

1 

493 

-16.6 

E 

E 

A 


U1-U2-U2 

565 

-14.3 

D 

E 

A 


U- r U 3 -U 4 -U 3 -U 3 -U 4 -U,-U2-U 2 

107 

-6.8 

A 

E 

A 


U\-U\-U 2 -U 2 

58 

-5.4 

E 

A 

E 

E 

U1-U1-U2-U3 

336 

-13.4 

E 

E 

A 

E 

UI-U2-U2-U3 

356 

-12.7 

E 

E 

E 

A 

u,-u 2 -u 3 -u 3 

431 

-9.8 

D 

E 

E 

A 

Ih-Ui-Ue-Ur-UrUs-lh-Ui-Ua-Ua 

56 

-7.0 

D 

E 

A 

E 

Ui-U^-Uq-Uq-U 4 -U§-U 1 -U 2 -U 2 -U 3 

61 

-6.6 


Table 8. New-comer contributor non-sequential sequences, counts, and Z-scores. All sequences in tables appear significantly above or below chance at 
p < 0.001 (with Bonferroni correction). 


1 

^abe 

2 

3 

Example 

Count 

Z-Score 

A 

E 


U1-U1-U2 

1,003 

-13.2 

B 

E 


U1-U2-U1-U3 

986 

-7.3 

C 

E 


Ui-U 3 -U 4 -Ui-U 2 

791 

-4.1 

E 

A 

E 

UI-UI-U2 

493 

-16.6 

A 

E 

E 

U1-U1-U2-U3 

639 

-12.0 

B 

E 

E 

U,-U 2 -U-rU 3 -U 4 

612 

-7.8 


Table 9. Unusual contributor sequences, counts, and Z-scores. All se¬ 
quences in tables appear significantly above or below chance at P < 
0.001 (with Bonferroni correction). 


tion suggests that there are sudden shocks (events internal to 
Wikipedia or external) that are attracting editing activities by 
first-time contributors. Earlier accounts of Wikipedia have 
demonstrated that articles about breaking news incidents ex¬ 
hibit high-tempo coordination dynamics, with structures and 
dynamics distinct from those observed among articles about 
non-breaking events [40], 

In order to explore these bursts of activity by peripheral mem¬ 
bers, we performed a follow-on analysis seeking to identify 
the event that triggered those sequences. We studied Google 
searches for the keywords corresponding to articles’ titles, 
finding that often peaks in the usage of that particular key¬ 
words are associated with the start of E sequences, imply¬ 
ing that an external event (attracting the Google searches) 
has triggered the burst of activity by those first-time contrib¬ 
utors. For example, Google searches for the Canadian actor 
Cameron Bright peeked in August 2009 (possibly associated 
with the anticipated release of the film “The Twilight Saga: 
New Moon” where he played a leading role), and at the same 
time the Wikipedia article for Cameron Bright saw a burst of 
activity by first-time arrivals to this article. 

Distinctive motifs 

The forth pattern we observe in Table 9 is that newcomers to 
the article (label E) are unlikely to contribute after an editing 
session made by a previously-active contributor (labels A , B, 
or C). The rare occurrence of A—E and C—E motifs implies 
that there are activity periods where insiders and “owners” of 
an article concentrate their effort. During those times, out¬ 


siders are unlikely to join the co-authoring effort. To the best 
of our knowledge, this particular motif has not been reported 
in prior studies of peer production. 

Case Study Discussion 

Our case study illustrates the potential of sequence analysis 
to illuminate deeper patterns within online knowledge col¬ 
laborations using a mixture of qualitative and quantitative 
methods [58], We used a method to capture relative ac¬ 
tivity of editors, defined a set of sequential patterns, mea¬ 
sured their frequency within a cross-sample of Wikipedia 
data, and tested the likelihood of observing patterns compared 
to chance and using the Bonferroni correction to increase con¬ 
fidence in the validity of the significance of the observed re¬ 
sults. Having identified the prevalence of many quantitative 
patterns occurring significantly more than random, we em¬ 
ployed a qualitative analysis to iteratively code the data, iden¬ 
tifying different types of multi-label motifs. Grouping mo¬ 
tifs that represent similar sequences characteristics, we were 
able to identify four distinct key patterns. Some of these pat¬ 
terns were already documented in Wikipedia ( e.g ., destruc¬ 
tive ping-pongs), others reported for other peer-production 
projects but not in Wikipedia {e.g., co-production relying 
on a single contributor work sessions, as opposed to multi¬ 
contributor teamwork), and yet others have not yet been doc¬ 
umented in Wikipedia or more broadly in peer-production 
(e.g., sequences of productive ping-pongs). 

This case study is illustrative of the potential of a sustained re¬ 
search agenda that would employ sequence analysis methods 
originally developed in sociology, biology, and other fields, 
in order to analyze complex organizational processes within 
peer production communities and online knowledge collabo¬ 
rations [25, 69]. Event logs archiving the activities, artifacts, 
performers, and order of actions are common across socio- 
technical systems, but only a handful of analyses to date have 
explored the temporal or sequential patterns within these ac¬ 
tivity sequences. We outlined a preliminary set of approaches 
for defining different relationships within sequences based on 
varying levels of analysis. Having identified sequences from 
socio-technical event log data, a range of established meth¬ 
ods from pattern mining, sequence similarity, and probabilis¬ 
tic analysis are available for quantitative modeling. 
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SEQUENCE ANALYSIS RESEARCH AGENDA FOR CSCW 

While sociologists have championed the role of sequence data 
for understanding questions about the prevalence of patterns 
within sequences, the independent predictors of sequence pat¬ 
terns, and the consequences of sequence patterns [1], these re¬ 
search questions have made few in-roads to human-computer 
interaction (HCI) or computer-supported cooperative work 
(CSCW). Having described our proposed framework for se¬ 
quence analysis of socio-technical trajectories and having il¬ 
lustrated its application through a case study, we now turn our 
attention to outlining a research agenda for the application of 
this framework to answer questions about the configuration, 
prevalence, and relationship of sequence data in the field of 
CSCW. We first discuss this research agenda at the method¬ 
ological level, describing various types of research objectives 
that could be addressed through sequence analysis. We then 
proceed to review common CSCW research domains and set¬ 
tings and provide examples of how our framework could be 
applied across them. 

Relationships within Sequences 

Sequences are fundamentally relational constructs that con¬ 
nect a series of activities, artifacts, and performers together. 
We primarily define these sequences by a time order, where 
elements are made to be adjacent if they occur after each 
other. Instead of generating an artifact-oriented event log 
by sub-setting Wikipedia’s entire event log by a single arti¬ 
cle’s event log as we did in Table 1, we could also subset 
Wikipedia’s event log by a single user’s contribution history. 
This user-oriented event log would reveal the sequence of ac¬ 
tions of a single user constituting their editing history over 
time and across artifacts, documenting the accumulation of 
actions that ultimately define a social identity or a role [7, 
29]. Alternatively, sub-setting on a single edit type [42] could 
reveal distinctive sub-classes of work or the evolution of rou¬ 
tines. Each of the sequences generated from these smaller 
subsets can then be used as a case and compared against 
each other in subsequent analyses. Thus a sequence of multi¬ 
dimensional events might be “projected” onto a single dimen¬ 
sion. That is, a single contribution can be a member of dif¬ 
ferent types of sequences depending on the unit of analysis 
employed at the performer (editor), artifact (article), or ac¬ 
tivity (edit type) levels, and these different sequences can be 
compared in turn. 

Recurring Patterns across Sequences 

Questions about patterns within sequences primarily ask 
whether there are typical sub-sequences. How common or 
unusual is a pattern or motif? Do some actions reliably 
follow others? Can we interpret a cluster of highly-similar 
sequences as representing a class of behavior? Furthermore, 
the potential of extracting multiple sequences from event logs 
likewise invite questions about the alignments between these 
different sequences, for example: are changes in activity 
types coupled with changes in performers? Finally the use 
of probabilistic models allows us to analyze changes in the 
likelihood of state transitions between a series of adjacent 
sequences. 


Interpretation of Sequences 

Event log data in socio-technical systems like Wikipedia and 
GitHub enables a kind of archival ethnography (or informa¬ 
tion archaeology) by which the preceding and subsequent ac¬ 
tions in the past can be re-read and re-interpreted, much as 
a contemporaneous user would have also interacted with the 
subject [33], Sequence analysis methods can be used as a way 
of triangulating between nested approaches: “zooming out” 
from simple dyadic relationships to identify macro-social sta¬ 
tistical patterns about similarity, and “zooming in” to nar¬ 
ratively interpret quantitative outliers as the consequence of 
specific contexts, emergent patterns, or crucial turning points 
that the quantification overlooked [58], Qualitative analy¬ 
sis is likewise crucial for identifying breaks in the chain of 
reasoning and threats to validity when translating data ex¬ 
tracted from information systems through analytical model¬ 
ing and transformed to the relevant analytical constructs un¬ 
der study [35], 

Antecedents of Sequenced Behavior 

The patterns observed in sequences are not random, but rather 
reflect path-dependencies underlying social processes such as 
collaboration and conflict. This class of questions can explore 
how other behavioral or psychological constructs of people or 
technological affordances of a technology influence the kinds 
of sequences that activities, artifacts, and performers subse¬ 
quently become embedded within. For example, Wikipedia 
editors engaged in anti-vandalism work are likely to have 
very distinct sequential patterns in their contribution behav¬ 
ior compared to editors engaged in copy-writing. The or¬ 
ganizational and external contexts also influences the kinds 
of sequences observed: high-tempo collaborations following 
crises may prompt different kinds of sequences than those fol¬ 
lowing traditional, low-tempo collaborations [39]. 

Consequences of Sequenced Behavior 

Sequences may influence other outcomes observed in online 
knowledge collaborations. Research may explore how cer¬ 
tain activity patterns influence individuals’ performance (e.g., 
administrative promotions) as well as organizational perfor¬ 
mance (e.g., the production of high-quality content). Inves¬ 
tigating the predictors of article quality in Wikipedia is an 
active research area [5, 55] and future research could explore 
the relationships between particular sequences of activity and 
article quality. In addition, the organizational literature dis¬ 
cusses ways in which joint involvement in routines could fa¬ 
cilitate the formation of relationships [22], thus a possible 
area for exploring the relationship between co-creation se¬ 
quences and the creation of interpersonal relationships. 

Implications for CSCW Research 

Our framing and case study emphasized the relevance of 
sequence analysis methods for understanding online knowl¬ 
edge collaborations such as peer-production (and in partic¬ 
ular, Wikipedia). However, these methods also have clear 
applications to other CSCW domains. Table 11 identifies 
six established and emerging CSCW sub-topics and provides 
examples for research questions that could be addressed by 
employing sequence analysis approaches. We illustrate each 
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Topic 


Collaborative filtering 


Research Questions 

• How do users’ rating of the same items vary based on the position of the items within 
a sequence of users’ other ratings? (Recurring Pattern) 

• What are features that predict the beginning and end of a user’s rating sessions? 

(Antecedent) 


• How does the accuracy of recommendations using sequence similarity compare 
to traditional collaborative filtering methods? (Consequence) 


Online social networks 


Crowdsourcing 


Online education 


Crisis informatics 


Citizen science 


• Are there similarities in the sequences of users’ creating and deleting relationships? 

(Relationships) 

• Do users’ information seeking sequences differ by demographics, skill, personality, or 
relationship strength? (Antecedent) 

• How do variations in the sequence of items presented in a stream or feed influence user 
engagement? (Consequence) 

• Are there sequences in a worker’s early history that predicts their subsequent level of 
commitment or quality? (Recurring Pattern) 

• Can the sequences of HITs completed by multiple workers reveal shared latent interests? 

(Interpretation) 

• How does HIT quality vary in relation to the task demands of prior tasks? (Consequence) 

• How do specific teaching elements disrupt the flow of successfully completing course 
content? (Recurring Pattern) 

• What do clusters of course participation sequences reveal about students’ interests and 
backgrounds? (Antecedent) 

• Does the sequence of participation in prior courses influence subsequent course 
engagement and performance? (Consequence) 

• What are common behavioral sequences taken by users in the immediate aftermath of 
an event? (Recurring Pattern) 

• How do users adopt social roles and employ routines in response to the sequence 
of new information about an event? (Interpretation) 

• How were pre-existing routines and behaviors disrupted by an event and how long did 
this disruption persist? (Consequence) 

• To what extent do contributors’ motivations and attitudes determine their future 
participation patterns? (Antecedents) 

• What are the sequences of activities that predict one’s commitment, formal positions 
taken, and sustained participation? (Consequences) 

• What do the sequences of feedback (ratings, comments) around a reported observation 
tell us about the reliability of this observation? (Interpretation) 


Table 10. Examples of research questions for extending sequence analysis methods into other CSCW topics. 


CSCW sub-topic with three research questions, as well as 
map each of these research questions back to one of the re¬ 
search objectives described above. This table is not intended 
to exhaust all of the potential sub-topics that exist within 
CSCW, but rather to illustrate the potential for the proposed 
approach across a variety of areas that have been at the center 
of CSCW research: collaborative filtering, online social net¬ 
works, crowdsourcing, online education, crisis informatics, 
and citizen science, among many other domains. The anal¬ 
ysis summarized in Table 10 outlines how sequence analysis 
could extend our understanding of socio-technical behavior 
across diverse CSCW domains. 

CONCLUSION 

Event log data in socio-technical systems like Wikipedia en¬ 
code a rich variety of content and meta-data about who is 


contributing what, where, when, and how. Our framework 
for analyzing behavioral sequences in event log data intends 
to answer questions pertaining to the structure and dynamics 
of online knowledge collaboration. Our research framework 
adapts a generalizable definition of event logs, and outlines 
the process of identifying, schematizing, analyzing, and in¬ 
terpreting sequences of activities, artifacts, or performers that 
co-occur. It is important to emphasize that interest in studying 
the temporal dynamics within online collaboration is not new; 
however, prior studies investigating dynamics within peer- 
production tended to “collapse” and “aggregate” data, thus 
losing important details regarding temporal sequences. Se¬ 
quence analysis methods are able to mitigate these concerns 
and provide a richer description of the temporal dynamics 
underlying socio-technical systems. We outlined a step-by- 
step procedure for employing sequence analysis in a variety 
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of CSCW domains and in demonstrating the validity of this 
approach through a case study that explored the prevalence 
and statistical significance of contributors’ activity sequences 
in the context of Wikipedia. 

This case study examining Wikipedia editors’ sequential ed¬ 
its to articles represents an implementation of the “grammat¬ 
ical” models proposed by theorists of “practice” in organiza¬ 
tional studies. Much as language permits variation in form 
to convey similar meanings, organizations may support vari¬ 
ations in patterns of actions to achieve similar goals. The 
prevalence (or absence) of repeated patterns of behavior in 
both structural and performative processes has been a cor¬ 
nerstone of theoretical developments in the areas of organi¬ 
zational routines and practice theory. However, empirically 
testing these grammatical models has been limited by the dif¬ 
ficulty of obtaining data from ethnographic field studies. To 
date, there is very little empirical support for these grammati¬ 
cal theories of organizational routines. Socio-technical event 
log data has the variability, breadth, and granularity to adapt 
grammatical models of organizational routines to understand 
the variation, selection, and retention of new practices and 
patterns of actions in technology-mediated social participa¬ 
tion. By schematizing contributors’ activity histories relative 
to each other, we were able to identify patterns of action oc¬ 
curring significantly above (or below) chance. Findings from 
our empirical case study provide support for Feldman’s idea 
of performativity in organizational routines, whereby people 
adopt or abandon routines in response to others’ actions. 

As a methodological contribution, this research agenda is 
a call to extend methods for examining and interpreting 
the relationships, patterns, antecedents, and consequences 
of sequences in event log data across socio-technical sys¬ 
tems. While underutilized within CSCW, methods for ana¬ 
lyzing, aligning, and modeling sequences are well-developed 
in fields like bio-informatics, thus inviting potentially com¬ 
pelling analogies for future theoretical elaboration. Far from 
being a framework applicable only to peer production or on¬ 
line knowledge collaborations, sequence analysis methods 
have substantial potential to become a first-class method to 
answer research questions about event log data. Consider 
natural language processing or network science: both pro¬ 
vide researchers with a set of tools drawn from to extract 
features and structured information from complex, multidi¬ 
mensional data; both have long, interdisciplinary histories; 
and both have been adapted to support mixed methods re¬ 
search. We foresee a future where scholars studying online 
group processes could draw from a portfolio of techniques 
that includes sequence approaches alongside CSCW’s other 
mixed methods like content or network analysis. 

Although developed as a quantitative method, sequence anal¬ 
ysis approaches have very high potential for supporting mixed 
methods inquiry in the context of event log analysis. Results 
from sequence analysis can identify common patterns or out¬ 
liers, which invite closer scrutiny. Because sequences involve 
actors and artifacts with attributes like status or quality (re¬ 
spectively), these co-variates allow comparison of different 
kinds of data measuring the same phenomena. In addition 


to juxtaposing across distinct kinds of data, these data can 
also be modeled in relation to each other as predictors or 
consequences of other processes. In effect, sequence anal¬ 
ysis methods can provide a set of second-order behavioral at¬ 
tributes that are conceptually and theoretically distinct from 
first-order attributes like counts or intensities. Notwithstand¬ 
ing the merits of methods such as content analysis of tex¬ 
tual data or network analysis of dyadic interactions, we main¬ 
tain that integrating first- and second-order behavior features, 
opens up a whole new set of ways to compliment and con¬ 
firm other approaches for contextualizing, triangulating, and 
confirming socio-technical behaviors. 
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